Goto

Collaborating Authors

 internal covariate shift





Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models

Sergey Ioffe

Neural Information Processing Systems

Batch Normalization is quite effective at accelerating and improving the training of deep models. However, its effectiveness diminishes when the training mini-batches are small, or do not consist of independent samples. We hypothesize that this is due to the dependence of model layer inputs on all the examples in the minibatch, and different activations being produced between training and inference. We propose Batch Renormalization, a simple and effective extension to ensure that the training and inference models generate the same outputs that depend on individual examples rather than the entire minibatch. Models trained with Batch Renormalization perform substantially better than batchnorm when training with small or non-i.i.d.


905056c1ac1dad141560467e0a99e1cf-Paper.pdf

Neural Information Processing Systems

Batch Normalization (BatchNorm) is a widely adopted technique that enables faster and more stable training of deep neural networks (DNNs). Despite its pervasiveness, the exact reasons for BatchNorm's effectiveness are still poorly understood.


Rethinking Layer-wise Model Merging through Chain of Merges

Buzzega, Pietro, Salami, Riccardo, Porrello, Angelo, Calderara, Simone

arXiv.org Artificial Intelligence

Fine-tuning pretrained models has become a standard pathway to achieve state-of-the-art performance across a wide range of domains, leading to a proliferation of task-specific model variants. As the number of such specialized models increases, merging them into a unified model without retraining has become a critical challenge. Existing merging techniques operate at the level of individual layers, thereby overlooking the inter-layer dependencies inherent in deep networks. We show that this simplification leads to distributional mismatches, particularly in methods that rely on intermediate activations, as changes in early layers are not properly propagated to downstream layers during merging. We identify these mismatches as a form of internal covariate shift, comparable to the phenomenon encountered in the initial phases of neural networks training. To address this, we propose Chain of Merges (CoM), a layer-wise merging procedure that sequentially merges weights across layers while sequentially updating activation statistics. By explicitly accounting for inter-layer interactions, CoM mitigates covariate shift and produces a coherent merged model through a series of conditionally optimal updates. Experiments on standard benchmarks demonstrate that CoM achieves state-of-the-art performance.


Mask-PINNs: Mitigating Internal Covariate Shift in Physics-Informed Neural Networks

Jiang, Feilong, Hou, Xiaonan, Ye, Jianqiao, Xia, Min

arXiv.org Artificial Intelligence

Physics-Informed Neural Networks (PINNs) have emerged as a powerful framework for solving partial differential equations (PDEs) by embedding physical laws directly into the loss function. However, as a fundamental optimization issue, internal covariate shift (ICS) hinders the stable and effective training of PINNs by disrupting feature distributions and limiting model expressiveness. Unlike standard deep learning tasks, conventional remedies for ICS -- such as Batch Normalization and Layer Normalization -- are not directly applicable to PINNs, as they distort the physical consistency required for reliable PDE solutions. To address this issue, we propose Mask-PINNs, a novel architecture that introduces a learnable mask function to regulate feature distributions while preserving the underlying physical constraints of PINNs. We provide a theoretical analysis showing that the mask suppresses the expansion of feature representations through a carefully designed modulation mechanism. Empirically, we validate the method on multiple PDE benchmarks -- including convection, wave propagation, and Helmholtz equations -- across diverse activation functions. Our results show consistent improvements in prediction accuracy, convergence stability, and robustness. Furthermore, we demonstrate that Mask-PINNs enable the effective use of wider networks, overcoming a key limitation in existing PINN frameworks.



Reviews: Revisit Fuzzy Neural Network: Demystifying Batch Normalization and ReLU with Generalized Hamming Network

Neural Information Processing Systems

The authors use a notion of generalized hamming distance, to shed light on the success of Batch normalization and ReLU units. After reading the paper, I am still very confused about its contribution. The authors claim that generalized hamming distance offers a better view of batch normalization and relus, and explain that in two paragraphs in pages 4,5. The explanation for batch normalization is essentially contained in the following phrase: "It turns out BN is indeed attempting to compensate for deficiencies in neuron outputs with respect to GHD. This surprising observation indeed adheres to our conjecture that an optimized neuron should faithfully measure the GHD between inputs and weights."


Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models

Sergey Ioffe

Neural Information Processing Systems

Batch Normalization is quite effective at accelerating and improving the training of deep models. However, its effectiveness diminishes when the training minibatches are small, or do not consist of independent samples. We hypothesize that this is due to the dependence of model layer inputs on all the examples in the minibatch, and different activations being produced between training and inference. We propose Batch Renormalization, a simple and effective extension to ensure that the training and inference models generate the same outputs that depend on individual examples rather than the entire minibatch. Models trained with Batch Renormalization perform substantially better than batchnorm when training with small or non-i.i.d.